Skip to main content

Full text of "Approximate word matches between two random sequences"

See other formats


The Annals of Applied Probability 
2008, Vol. 18, No. 1, 1-21 
DOI: 10.1214/07-AAP452 

(c) Institute of Mathematical Statistics, 2008 



APPROXIMATE WORD MATCHES BETWEEN TWO 
RANDOM SEQUENCES 

By Conrad J. Burden, Miriam R. Kantorovitz^ 
AND Susan R. Wilson 

Australian National University, University of Illinois 
and Australian National University 

Given two sequences over a finite alphabet C, the D2 statistic 
is the number of m-letter word matches between the two sequences. 
This statistic is used in bioinformatics for expressed sequence tag 
database searches. Here we study a generahzation of the D2 statistic 
in the context of DNA sequences, under the assumption of strand 
symmetric BernouUi text. For fc < m, we look at the count of m-letter 
word matches with up to k mismatches. For this statistic, we compute 
the expectation, give upper and lower bounds for the variance and 
prove its distribution is asymptotically normal. 

1. Introduction. Methods for alignment-free sequence comparison are 
among the more recent tools being developed for sequence analysis in biol- 
ogy [16]. A disadvantage in the classical Smith- Waterman local alignment 
algorithm [13], as well as the popular search algorithms such as FASTA and 
BLAST, is that they assume conservation of contiguity between homologous 
segments. In particular, they overlook the occurrence of genetic shuffling 
[18]. Alignment-free sequence comparison methods are used to compensate 
for this problem. 

A natural alignment-free comparison of two sequences is the number of 
m-letter word matches between the sequences. This statistic, called D2, 
can be computed in linear time in the length of the sequences, which is 
also an advantage over the nonlinear local alignment algorithms. D2 is used 
extensively for EST sequence database searches (e.g., [2, 3, 11] and in the 
software package STACK [6]). 

In [10], Lippert, Huang and Waterman started a rigorous study of D2 
using the model of independent letters in DNA sequences. A formula for 



Received May 2006; revised May 2007. 

^Supported by Australian Research Council Discovery Grant DP0559260. 
AMS 2000 subject classifications. 60F17, 92D20. 

Key words and phrases. DNA sequences, sequence comparison, word matches. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Probability, 
2008, Vol. 18, No. 1, 1-21. This reprint differs from the original in pagination and 
typographic detail. 



1 



2 



C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 



the expectation was computed as well as upper and lower bounds for the 
variance. Limiting distributions, as the length of the sequences, n, and the 
size of the word, m, get large, were derived in some cases. The authors used 
Stein-Chen methods [5, 9, 14] to obtain the following results. When the 
underlying distribution of the alphabet is nonuniform, the distribution of 
D2 has normal asymptotic behavior when m/log^n < 1/6. The logarithmic 
base b is defined by 6 = {J2a<^c Ca)~^) where is the probability of a letter 
taking the value a. Following simulations, it was noted in [10] that the bound 
1/6 above is too small. Our simulations in Section 6 suggest that the bound 
should be closer to 2. 

Another asymptotic regime was identified in [10] when m/log^n > 2. In 
this case, the distribution of D2 has compound Poisson asymptotic behav- 
ior. However, as pointed out in [17] and [1], the Poisson approximation is 
meaningful in this region only when E{D2) is not too small. To control this 
degenerate case, one needs to add the linear restriction m = 21og^n + C. 

When the underlying distribution of the alphabet is uniform, it was proved 
in [9] that for m = alogf^n + C with < a < 2, the distribution of D2 is also 
asymptotically normal. 

A natural generalization of the D2 statistic is to count the number of 

(k) 

approximate m-word matches. For k <m, let D2 be the number of m-word 
matches with up to k mismatches between the two sequences. This statistic 
can be expressed in terms of a distance function. One can define the distance 
between two m-words to be the number of mismatches. A /c-neighborhood of 
an m-word w is then all m-words that are at most k distance from w. The 

(k) 

D2 statistic is the number of A;-neighborhood matches of m-words between 
two sequences. 

In [12], Melko and Mushegian studied the A:-distance and fc-neighborhood 
match count between a probe of length m and a random DNA sequence, 
under the assumption that the sequence is strand-symmetric Bernoulli text. 
They gave a formula for the expectation of the /c-distance match count 
and the /c-neighborhood match count. Melko and Mushegian suggested that 
methods of Lippert, Huang and Waterman in [10] could be used to obtain 

(k) 

upper and lower bounds for the variance of D2 and to analyze its asymp- 
totic behavior. 

(k) 

In this paper we study the D2 statistic under the strand-symmetric 
Bernoulli text assumption. We extend the method of [10] to give upper and 
lower bounds for the variance. We analyze the asymptotic behavior of the 
distribution of 1^2'^^ as n and m increase using the method of cumulants [8] 
rather than Stein's method. For D2, the A; = case, this method improves 
the bound on m/log^n obtained in [10] from 1/6 to 1/2. 

The organization of this paper is as follows. In Section 2 we review def- 
initions and introduce notation. In Section 3 we discuss the mean of 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 



3 



Section 4 is devoted to the variance of -Dg ■ Section 5 we prove normal 

(k) 

asymptotic behavior of the distribution of D2 ■ Section 6 contains the results 
of numerical simulations, and a concluding summary is given in Section 7. 
A list of notations is provided at the end of Section 7. 

2. Preliminaries. Let C = {A, G, C, T} with strand-symmetric probabil- 
ity measure ^ = {(,a, £,Gi (,Ci Ct} and perturbation parameter r/. That is, — 1 < 
rj <1 IS the unique number satisfying 



Let A = A1A2 ■ ■ ■ An and B = B1B2 - ■ - be two random sequences of 
length n over the alphabet C. We assume that A and B are Bernoulli 
texts, meaning, the letters (nucleotides) are independent and identically dis- 
tributed (i.i.d.). We note that the assumption of both sequences having the 
same length is not essential for what follows and its main purpose is to 
simplify notation. Our results can be easily adapted to the case when the 
sequences are of different lengths. 

Definition 2.1. Let x and y be two words of length m. We define the 
distance between x and y to be 

5(x, y) = number of character mismatches between x and y. 

For k < m, we say that x is a k-distance match of y if 6{x,y) = k. When 
^(x, y) < k, then x is said to be a k-neighbor of y. 

Following the terminology and notation of [12], we have the following 
definition. 

Definition 2.2. In the above setup, define the perturbed binomial dis- 
tribution with perturbation parameter rj by 

9k{m, 7], c) = h{m, rj, c)uk{m, rj, c), 

where < c,k < m are integers and 



eT = i(i+r?), 



h{m, rj, c 



m—c 



Uk{m,r],c) 



m—k / \ / 
i=0 ^ ^ ^ 



m — k — i 



m — c 



) 



Vk{i,r],c) 




4 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 

For an m-word w with GC-count Cw, h{m, rj, Cw) is Pr(w), the probabihty of 
seeing w. In the definition of Uk{m, rj, c), and in similar situations throughout 
the paper, we fohow a general convention that (^) = if a < or a > n. 

Note that when = 0, the perturbed binomial distribution is the binomial 
distribution with gk{m,0,c) = bk{m, 1/4), where 

As observed in [12], if T is a strand-symmetric Bernoulli text of length 
m and q is a (known) query text (=word) of length m, then the probability 
distribution of the distance (5(T,q) is a perturbed binomial distribution: 

Pr(5(T, q) = fc) = c/fc(m, r], c), 

where c is the GC-count in q and rj is the perturbation parameter of T. Let 

k 

Gk{m,r],c) =J29r{m,r],c) =Pr((5(T,q) <k) 

r=0 

be the cumulative distribution function of the distance. 

2.1. k -neighborhood matches. Let A and B be two DNA sequences of 
length n. Assume the sequences are strand-symmetric Bernoulli text with 
perturbation parameter rj. Let 0<k<m<nhe integers. 

Definition 2.3. Define the statistic 0^2^^ = D{k,m,n) to be the num- 
ber of fc-neighborhood m-word matches between the sequences A and B, 
including overlaps. Note that D2^^ is the D2 statistic of [10]. 

(k) 

The Df' statistic may be computed as follows. 

Notation. For 1 < s <t <n, write A[s, t] for the subsequence AgAg-^i . . . 
At. 

(k) 

Definition 2.4. Let Y-^ be the ^-neighborhood match indicator (start- 
ing) at position (position i in sequence A and j in B). That is, 

y{k) ^ r 1, if d{A[i, i + m- l],B[j, j + m-l])<k, 
\ 0, otherwise. 

Then the d!^'' statistic can be computed via 

(*,i)6/ 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 5 

where the index set / is 

/ = GNxN:l<i<n-m + l, and l<j<n-m + l}. 

For convenience, we write n for n — m + 1. 
(k) 

3. The mean of • In this section we give a general formula for the 

(k) 

mean of in terms of the perturbed binomial distribution and obtain 

estimates for it. The estimates will be used in later sections in order to 

(k) 

prove normal asymptotic behavior of D2 ■ 

(k) 

First we compute the mean of Y^j ': 

^Kf ] = Pr(llf = 1) 

= ^ Pr(5(A[i,z + m - l],vir) < A:)Pr(B[j, j + m - 1] = w) 

= X! Gk{m,r],c^)Vi{Mv), 

where Cw is the GC-count of w. 

From this we get formulas for the expectation of D2 ■ 



71^ 



Pr(w)Gfc(?n,7?,Cw). 



Remark 3.1. When k = 0, we have Gfc(m, 77, Cw) = Pr(w) and 



^(0)1 
■ij 



This agrees with the formula given in Lippert, Huang and Waterman [10], 
E[Yi,]=pl^\ where P2 = Eae£C^ 

Definition 3.2. For t > 1, let 

Remark 3.3. Note that pt = E[{^xY~^]- Hence, by the Cauchy-Schwarz 
inequality, 

P3 >pI, 

where equality holds if and only if the distribution is uniform: = 1/|£|. 



6 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 

3.1. Estimates. The purpose of these estimates is to explain the asymp- 
totic behavior of -Dg*^^ , rather than to provide a computational tool. Hence, 
these estimates are by no means optimal. 

First we estimate the function gi^(^Tn,Tj, c-^) . Without loss of generality, 
assume > 0. From Definition 2.2, we have upper bounds: 



Vk{i,ri,c) < 



Ukim,r],c) < —— J2 



3 + r?\^ 
1 — r] 

"J + ^ \ I ^\ I iTi — c \ ( ^ + ( m 



I — rj J \i J \m — k — i J \1 — i] J \ k 



Remembering that h{m,r],c-w) = Pr(w) and using similar estimates for the 
lower bound we get 

i, h;-— <5'fc("i,r/,Cw) <Pr(w) ' 



K J \1 +riJ \ K J \l — T] 

and 

Pr(w)E(7)(^)'<G.K,,c.) 

(1) 

< Pr(w) 2^ ' 

r=0 



1 — 7] 



Hence, for E[Y^f^] =J2^|=oy^PT{w)Gk{m,r],c^), we have 
Finally, since E[d!^^] =J2fj^^E[Y^J'^], we have 

Remark 3.4. For k = (the exact matches case), (3) gives E[d!^^] = 
n^p™, which agrees with the expectation computed in [10]. 

Remark 3.5. When r] = (the uniform case), P2 = \ and the upper and 
lower bounds in (3) are equal. Hence, 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 



7 



k 

r=0 
k 



3 \ ^' / 1 ^ Tn—r 

r I V 4 j U, 



r=0 

4. The variance. In this section we give a lower and upper bound for 

(k) 

the variance oi D2 ■ The lower bound is used later to prove asymptotic 

(k) 

normality of D2 ■ The upper bound is not optimal, but is comparable with 
that given in [10] for the A; = case. We start this section by stating the 
main results in Propositions 4.1 and 4.2. We then prove several technical 
lemmas and finish with the proofs of Propositions 4.1 and 4.2. 



Proposition 4.1. Var(L»^ ^ >^ = 

l{n, m, k), where 

2k 



■ n 



(2n-4m + 2)(m-l)A;2^ ^ ^r'"^2m^P3 



l + V 



Proposition 4.2. 



Var(D^ < ^ (2ra - 4m + 2)m^''' i^- 
(4) -n2(2n-4m + 2 



.Y3 + r/ 



2k 



2pi 



-7] 



2P3 



1 -P3 



P3 



P2 



To compute the variance oi D2 = J2{i,j)GiYij ; we need to compute the 
covariances CoY{Yff\Y},'^}).Fov this, we use techniques from [17]. To shorten 
the indices' notation, let u = and v = 

In the following definition we use notation and terminology from [17], 
Chapter 11. 

Definition 4.3. Let Ju = {v = -.{i' — i\ <m or \j' -j\ < m}. Then 
Ju is the dependency neighborhood of Yu in the sense that if v ^ Ju, then 

(k) (k) 

Y^' and Y^'^' are independent. The dependency neighborhood can be de- 
composed into two parts, accordion and crabgrass, Ju = where 

Ju={v= eJu:\i'-i\<m and \j' -j\<m} and Jl = Ju\ 



8 



C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 



Let -u G /. When v i Ju, Cov{Y^ \Yr') = 0. To estimate Cov(yr\yrO 
when V & Ju, we look at the two cases: v £ (crabgrass) and v £ (ac- 
cordion). We will see that crabgrasses contribute the dominant term of 

Var(£»f ^) in the cases we are interested in, that is, for m = O(logn). Hence 

for accordions we only give a crude approximation of Cov{Yu^\Yv''^). We 
start by proving the following positivity lemma. 

Lemma 4.4. (i) For v G J^, Y^''^ and Y^''^ are nonnegatively correlated. 
That is, 



(ii) For V in the main diagonal of J^, Cow{YT,Yr')>Q. 

Proof. We will use the following notation. For r > define 

Y^p {r) = the fc-neighbor match indicator between two r- words at (i, j). 
\i[r) = 5[A[i,i + r- 1], j + r - 1]) 

= number of mismatches in an r-word match at {i,j). 
^2('") = (^(^[^' + m — r,i' + m — V\,B[j' + m — r,j' + m — 1]) 
= number of mismatches in an r-word match 
at {i' + m — r,j' + m — r). 

To prove part (i), let u= {i,j), v = {i' G J^. Write t = i' — i and s = j' — j. 
By symmetry, we may assume v is in the first quadrant of J^, that is, 
< t < m — 1, and m< s. We have 



Cov(yW,yW)>o. 



Pr(yW = i,yi'=) = i) 



J2 Pr(Ai(t) = Zi)Pr(A2(t) = /2) 



ll,l2=0 



(5) 



xPr(y(f;|^,))(m-t) = i,y( 



{i'd') 



{rn-t) = 1) 



^ Pr(Ai(t) = /i)Pr(A2(t)=/2) 



;i,«2=o 



X ^ Pr(w)Gfc_i,(m-t,?7,Cw)Gfc_i2("^-*'^.Cw) 



-wG£'"-* 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 9 



J2 Pr(Ai(t) = li)Gk^i, (m - t, rj, Cw) 



L h 



For wG/:"^-*, let 

(6) (w) = E Pr(A(t) = (m - t, r?, Cw) , 

I 

where A(t) is the distance between two random t- words. Then (5) says that 

i?[yWyW] = i<;[(/,(M^))2]. 

Similarly we get 

E[Y:^'^^]E[Y,^'^] = E[ftiW)f. 

Hence, 

(7) Cov(yj'=), y«) = VariftiW)) > 0. 

For part (ii), let u = and v = G J^'s main diagonal, that is, 

V = {i + t,j + 1) with |t| < m — 1. By symmetry, we may assume <t. As 
before, let Ai(t) be the number of mismatches in a t-word match at 
and let A2{t) be the number of mismatches at (i + m,j + m). Then 

£;[yWyW]= ^ Pr(Ai(t) =/i)Pr(A2(t) = ^2) 

X Pr(yJ'=-'i)(m - t) = l,yj'=-'2)(m - t) = 1) 

t 

= E Pr(Ai(t)=/i)Pr(A2(t) = /2) 

^1,^2=0 

X Pr(yJ"^^"^*^-^i'*^-'2})(^_i) = 1) 

and 



E[yP]E[Y^^^]= E Pr(Ai(t) = /i)Pr(A2(t) = /2) 

/l,«2=0 



X Pr(yJ'=-'i)(7n -t) = l)Pr(yJ'^-'2)(m - t) = 1). 



Since 



p^(yJmin{fc-ii,fc-«2})(^ - t) = 1) 

> Pr(yJ'=-'i)(m - t) = 1) Pr(yJ^-'2)(,„ -t) = l), 
we have that Cov(yi''^ , yj''^ ) > 0. □ 



10 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 

Remark 4.5. From (7) we have that for v G J^, Cov(yi''\yi^^) = 
Var(/t(VF)) for appropriate t. When computing /t(w) in (6), it is worth 
noting the following: 

1. Pr(A(t)=0 = ExG£'Pr(x)gKi,r/,Cx)=Exe/:'Pr(x)2ni(t,r?,Cx).R:om Sec- 
tion 3.1 we also have 



Hence 



2. For k — I >m — t and w G /Z"'"*, Gk~i{m — t, r], Cw) = 1- 

Remark 4.6. When k = (exact matches case), and v € with t as 
above, we get: 
For wG/:™-*, 

/t (w) = Pr(A(t) = 0)Go(m - t, r], Cw) = pI Pr(w). 
Since E[Fv{W)] = Ewe£--' PKw)^ = p™"* and E[Pr(VF)2] = 
Ewe£'"-* Pr(w)^ =^3*"*' we have that 

Cov(yW,yW) = Var(A(VF)) =prVar(Pr(w)) =pr(Pr* "P?""*^)- 
This agrees with the computations in [17], Section 11.5.2. 

Remark 4.7. When rj = (uniform case), /t(w) does not depend on w 
and hence Var(/f(l^)) = 0. Therefore, in this case, fort;G J^,Cov(yi'=\yi'=)): 
0. 

Next we look at the following special crabgrass case. 

Lemma 4.8. Let u = and v = {i' G with t = \i' — i\ = m — 1 
or (by symmetry) \j' — j\ = m — 1. Then 

Cov(yW,yW) = [Pr(A(m - 1) = k)]\p,-pl), 

where A(m — 1) is the distance between two random {m — 1) -words. 
Hence, by Remark 4-5, 



m-l\ fS-r]\'']^ 

P2 



k J \l+r] 



r(|-i)<Cov(yi'=),yW) 



< 



m — 1\ f 3 + r] 



k J \l-r] 



.1"(|-). 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 11 

Proof. By (7), 

Cov(yW,yJ^'))=Var(/i(Ty)). 
Let w £ C. Then, using Remark 4.5, 

/^(w) = ^ Pr(A(m - 1) = l)Gk^i{l, ri, c^) 

1=0 

k-l 

= Pr(A(m - 1) = A;)Go(l, V, Cw) + Pr(A(m -l) = l)-l 

1=0 

k-l 

= Pr(A(m - 1) = A;)^w + Pi'(A(m - 1) = 0- 

/=o 

Note that Pr(A(m - 1) = k) and J2'i=o Pr(A(m -l) = l) do not depend on 
w. Hence, 

Var(/t(VF)) = Var ^Pr(A(?n - 1) = k)Cw + J2 ^^(^("1 - 1) = 0^ 

= [Pr(A(m - 1) = k)fYaT{^w)- 
As noted before (Remark 4.6, with m — t = I), Var(^vy) =P3 ~ P2- ^ 

For the accordion case, we use the fohowing crude estimate. 

Lemma 4.9. For u, v G /, Cov(yi^\ yj''^) = 0{p^m''). 

Proof. 



(8) 



|Cov(yj'=),yW)| < Vvar(yi'))Var(yi'^)) 

= Var(yW)<i?[(yW)^]=i?[yi^)] 



< 



r=0 



f?Ec;j(^^) fro- (2 



4.1. Proof of Proposition 4.1. We now prove the lower bound formula 
for \ax{Df'^) stated in Proposition 4.1. 



12 



C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 



Proof of Proposition 4.1. First we split the variance into the con- 
tributions of crabgrasses and accordions: 

varrf))= Y: Cov(yW,yW) 

= E E Cov(yW,yW) + 5: y: Cov(yW,yj'^)). 

Next we look at the crabgrasses: 

> E E 

U=(i,j)€l v={i' ,j')(iJ^ 

\i' —i\=m—l or \j' —j\=m—l 

> E E 

u=(i,j)ei «=(i'j')eJS 

\i' —i\=m—l or \j'—j\=m—l 



Cov(yi'=), yj'^)) by Lemma 4.4 



1 + r/ 



P2 



by Lemma 4.8 



(2n -4m + 2) 



k 



l + V 



2k 



\P2 



Finally we consider the contribution of the accordions to the variance: 



E E Cov(yW,yi'^)) 



<EE lcov(yw,yi^))l 



^ E E ofe 



by Lemma 4.9 



Then 

Var(D5'=)) = E E Cov(yW,yJ'=)) +0(n2m'=+2p™) 



>2n^ (2n-4m + 2) 



m- 1\2 /3-77\2^ 2mYP3 



1 + 7? 



P2 



□ 



4.2. Proof of Proposition 4.2. The first two terms in (4) come from crab- 
grasses. Let u = {i,j) G I, V = {i',j') € with i' = i + t, <t <m — I. We 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 13 



need to bound Cov(yi''\ fJ''^) = Var(/t(P^)): 



E[ft{Wf]= P^(^) 
< J2 Pr(w)= 



'm—l 



^FiiAit) = l)Gk^iim-t,ri,c^) 

=0 

71—1 k—l 



=0 



r=0 



m — t\ fS + rj 
r J [T~^ 



by (1) 



(m-t)'' 
<P3 



{m-ty 



'm—l "1 2 

EPr(A(t)=/) 



E Pr(w)= 



/3 + 7?V 



fel 2 



and similarly, 



m—l 



E Pr(w) E Pr(A(i) = ^^^-/(m - t, r?, Cw) 



> 



> 



m — l fc— i 

5: Pr(w)2^Pr(A(i)=/)E 

n 2 



m-t\ f^-ri 
r ) [l + r] 



m — l 



Y: Pr(w)2^Pr(A(t)=0 
Lwe£™-t /=0 
2 



E 



m-ti2 



Hence, 



\1 — 7] J 

Summing up over all n's and f 's and using 



J2 E = ^^(2^ - 4m + 2) 



2g 



1 -g" 
1-Q 



14 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 

2k 



with q = P3 and p^j respectively, yields 



2k 



3 + 7] 

1 — r] 



2P3 



1 -P3 



P3 



-n^(2n-4m + 2) 
The last term in (4) comes from accordions: 



^ ^ Cov(yW,yJ'=))<5: 5: |Cov(yW,yW)| 



■ n 



'(2m-l)2p-^ 



r=0 



3 + 7? 

1 — r] 

3 + r? 
1 — r] 



from (8) 



5. Asymptotic behavior. We will need the following central limit theo- 
rem of Janson [8] for certain sums of dependent random variables. To state 
it, we first recall the definition of dependency graph. 

Definition 5.1. A graph T is a dependency graph for a family of ran- 
dom variables if the following holds: 

1. There is a one-to-one correspondence between the random variables and 
the vertices of the graph. 

2. If Vi and V2 are two disjoint sets of vertices of F such that there is no 
edge (fi,f2) in F with vi G Vi and V2 € V2, then the corresponding sets 
of random variables are independent. 

Also recall that the maximal degree of a graph is the maximal number of 
edges attached to a single vertex. 

Theorem 5.2 ([8], Theorem 2). Suppose that for each n, {Wni}f=i is 
a family of hounded random variables; \Wni\ < C„ almost surely. Suppose 
further that F„ is a dependency graph for this family and let M„ he the 
maximal degree of F„ [if F„ has no edges, set Mn = !)• Let Sn = Y^f=i Wni 
and = Var(S'„). If there exists an integer t such that 



(9) 
then 



{Nn/Mnf''MnCn/cTn ^0 aS 



n—>-oo, 



{Sn- E{Sn))/ an ^M{0,1) 



as n — oo. 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 15 
Next we state and prove our main theorem. 



Theorem 5.3. Assume that the four letters of the alphabet C are not 
uniformly distributed. That is, the perturbation parameter r] is not zero. Let 

fin = E[d!^^] and Un = V^Varpf ^). 

For m = a log^^^^^ i^) + C with < a < 1/2 and C a constant, and fixed k 
such that < k < m, 



D 



(k) 



M{0, 1) as oo. 



Proof. We apply Theorem 5.2 to the match indicator random variables 

(k) 9 

y> ^ In this case, the dependency graph has n vertices and edges may be 
defined by connecting the vertex {i,j) with (i', j') if |z' — i| < m or \j' — j\ < 
m. Hence, in the notation of Theorem 5.2, A'^^ = n^; C„ = 1; the maximal 
degree of r„ is the size of a dependency neighborhood: 



\Ju\ = (2m - l)(2n - 2m + 1); 



(fc) 



and Sn = D.^ 

Let m = a\ogiip^{n) + C with < a (and k fixed). Then for a < 1, the 
lower bound, i, for cr^ in Proposition 4.1, has the property 



(2n -Am + 2) 



7] 



2k 



P2 



2m P3 



= Cm^"^" + 0(n2-"(log(n))'=+2 
~ 7^3-2a since a < 1. 
Therefore, in condition (9) we have 

iNjMn)^/'MnCn 



1 



+ 0(nV=+2p^^ 



where Ci > is a constant 



(nV((2m - l)(2n - 2m + l)))^/*(2m - l)(2n - 2m + 1) 



< 



n 



(2m-l)(2n-2m + l) 



<7n 

1/t 



(2m-l)(2n-2m + l) 



(10) 



n 



(2n - 4m + 2) 



3 — 7] 
l + T] 



2k 



,2m. I P3_ 

pI 



1 



P2 

+ 0(n2m'=+V2") 



l/2s 



16 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 

(c.logi/p^(n))i-iAni+V* ^ (alogi/p^(n))i'i/* 

~ [n3-2«]l/2 „l/2-l/t-a 

^0 asn^oo, if 1/2 - 1/t - a > 0. 

Thus, for a < 1/2, we can find t large enough such that 1/2 — 1/t — a > 0. 
□ 

In [10], Lippert, Huang and Waterman used a variation on Stein's resuh 
([15], page 110), due to Dembo and Rinott ([7], Theorem 4.2), to prove 

the fohowing result, under the assumption of i.i.d. letters, for the D2 = 
statistic. Let C be an alphabet set of size \C\ > 1 with nonuniform probability 
measure ^. Then for m = alogi/p2{n) + C with < a < 1/6, 

=>J\[0,1) asn^oo. 

Following simulations, it was noted in [10] that the bound 1/6 above is too 
small. Our simulations in Section 6 suggest that the bound should be closer 
to 2. 

By adjusting the proof of Theorem 5.3 to an alphabet set of any size 
\C\ > 1, we can improve the bound on a from 1/6 to 1/2. Thus we have the 
following theorem. 

Theorem 5.4. Let L he an alphabet set of size \C\ > 1 with nonuniform 
probability measure Then for m = alogi/p2{n) + C with <a < 1/2, 

D2{n)- fin d . 

=^Jv[0,l) as n—> 00. 

Proof. From the lower bound for D2 in [10], 



yai:{D2) > n 



(2m - l)(2n - 4m + 2)p^™(p3/p^ - 1) 
+ p^fl±Il^_^2m-l)p- 

V 1 -P2 



~ alogi/p^{n){n^ ^") since (ps/p^ — 1) > by Remark 3.3. 
Hence, in (10) in the proof of Theorem 5.3 we now have 

(iV„./M„)V*M„a 

CTn 

_ (nV((2m - l)(2n - 2m + l)))^/\2m - l){2n - 2m + 1) 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 17 



^2 ^ 1/i 

<(- -— (2m-l)(2n-2m + l) 



(2m-l)(2n-2m + l 
X f n 



(2m - l)(2n - 4m + 2)p^™ ( ^ - 1 

(alogi/p^(n))i-V*ni+V* ^ (alogi/^^ (n)) V^'V* 
[alogi/p^(n)n3-2a]i/2 nVs-i/t-a 

^0 as n ^ cx), if 1/2 - 1/t - a > 0. 
The rest of the proof is the same as the proof of Theorem 5.3. □ 

Remark 5.5. When the underlying distribution is uniform, Yar{D^^) ~ 
■ Hence, this method of proof fails to show normal asymptotic behavior 
in the uniform case. In fact, for A; = 0, the distribution of 02'^ is not normal 
when |£| = 2, m = 1 and n — > 00 (see [10]). 



6. Numerical simulations. We have carried out numerical simulations 
of pairs of randomly generated sequences of length n = 100 x 2*, i = 0, . . . , 4 



with the nonuniform letter distribution iA = = \-, ic = = h- The statis- 



tic D2 was calculated for each sequence pair using an algorithm based on 
that given in [12]. Kolmogorov-Smirnov values [4] for the standardized 
statistic {D^"^ — fin)/crn compared with the standard normal distribution 
for sample sizes of 2500 sequence pairs are shown in Figure 1. Samples of 

(k) 

D2 which are a close approximation to the normal distribution will have 
p- values distributed uniformly on the interval [0, 1], whereas samples which 
are a poor approximation to the normal distribution will have small p- values. 

Entries in the tables in Figure 1 are shaded to indicate proximity of sam- 
ples to the standard normal distribution, with lighter shades signaling a 
better agreement. The white diagonal line in each table is m = 21ogi/p2 ^ ~^ 
const. The numerical evidence strongly suggests that 

— =^N{0,1) as n — i- 00, 

where the limit is taken along any line m = alogi^p^{n) + C with < a < 2 
for fixed k and C. 

7. Conclusions. We have studied the 02^^ statistic as suggested in [12], 
and defined it as the number of m-word matches with up to k mismatches 



18 



C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 




Fig. 1. Kolmogorov-Smirnov p-values for nonuniform d'"2^ with letter distribution 
=Ct = |, = Cg = I compared with a normal distribution. The white diagonal lines 
are m = alog^/p^n+ const., witha = 2 and '^/P2 = l/J2ae{A,c,G,T}^"-^T'- 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 19 



between two sequences of length n, for strand-symmetric Bernoulli texts 
with a nonuniform letter distribution. We have extended methods applied 
in [10] to give upper and lower bounds on the variance, and have also studied 
the asymptotic behavior as the sequence length n and word length m tend 
to infinity for fixed k. 

(k) 

We have proved that the asymptotic distributional behavior of D2 is 
normal as n ^ 00 for a pair of strand-symmetric Bernoulli texts provided 
the limit is taken along any line m = alogi/p^ n C with < a < ^. For 
/c = this result is also shown to hold for a pair of Bernoulli texts with any 
nonuniform letter distribution. This improves the previous bound for the 
A; = case of a < g given in [10]. 

We have also carried out numerical simulations of strand-symmetric texts 
with letter distribution S,a = = \ , ic = = \ - These simulations strongly 
suggest that the optimum restriction on asymptotic normal behavior may be 
as high as a < 2. This is consistent with simulations in [10] and their result 
that the asymptotic distributional behavior of D^^ is a compound Poisson 
distribution for a>2. 

List of notations. 

D2: The number of m-letter word matches between two given sequences. 
D2 ■ The number of m-letter word matches with up to k mismatches (0 < 

k <m) between two given sequences. 
gk{m,r],c): The perturbed binomial distribution (Definition 2.2). Given a 

strand-symmetric Bernoulli text of length m and perturbation parameter 

r], and an m-word with GC-content c, this is the probability distribution 

of the number of character mismatches between the text and the m-word. 
Gfc(m, rj, c): The cumulative distribution function of the perturbed binomial 

distribution gk{m,r],c). 
hk{Tn,rj^c): For a given m-word with GC-content c, the probability that the 

word occurs at a given site in a strand-symmetric Bernoulli string with 

perturbation parameter 77. (See Definition 2.2.) 

Ju'- The dependency neighborhood of Yu^^ , where u= that is, the word 

locations v = {i' such that either the word at i' overlaps the word at i 

in the first sequence, or the word at j' overlaps the word at j in the first 

sequence, or both. (See Definition 4.3.) 
J^: The accordion, that is, the subset of Ju such that both the word at i' 

overlaps the word at i in the first sequence, and the word at j' overlaps 

the word at j in the first sequence. 
J^: The crabgrass, that is, the subset of Ju such that either the word at i' 

overlaps the word at i in the first sequence, or the word at j' overlaps the 

word at j in the first sequence, but not both. 



20 



C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 



k: The number of mismatches. 
m: The word length. 

n: The length of each of the two sequences. 

n: n — m + 1, the number of possible locations of an m-word in a sequence 
of length n. 

Pt- "llaec^a^ where the sum is taken over the alphabet C. 
Mfc(m, 77, c), ry, c): Functions occurring in the definition of gk{m,r],c). 
(See Definition 2.2.) 

Yu''^ or y}-'^ : The indicator random variable for the event that the ?Ti-word 
starting at position i in the first sequence has no more than k mismatches 
with the m-word starting at position j in the second sequence. We use 
the convention u= {i,j),v = throughout. 

77: The perturbation parameter for a strand-symmetric Bernoulli text. (See 
Section 2.) 

S^a- The probability of finding the letter a at a given location in a strand- 
symmetric Bernoulli string. 

REFERENCES 

[1] Barbour, A. and Chryssaphinou, O. (2001). Compound Poisson approximation: 

A user's guide. Ann. Appl. Probab. 11 964-1002. MR1865030 
[2] Burke, J., Davison, D. and Hide, W. (1999). d2 cluster: A validated method for 

clustering EST and full-length cDNA sequences. Genome Res. 9 1135-1142. 
[3] Carpenter, J. E., Christoffels, A., Weinbach, Y. and Hide, W. A. (2002). 

Assessment of the parallelization approach of d2 cluster for high-performance 

sequence clustering. J. Comput. Chem. 23 755-757. 
[4] CONOVER, W. J. (1999). Practical Nonparametric Statistics. Wiley, New York. 
[5] Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Ann. Probab. 3 

534-545. MR0428387 

[6] Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T. and 
Hide, W. (2001). STACK: Sequence tag alignment and consensus knowledge- 
base. Nucleic Acids Res. 29 234-238. 

[7] Dembo, a. and Rinott, Y. (1996). Some examples of normal approximations by 
Stein's method. IMA Vol. Math. Appl. 76 25-44. MR1395606 

[8] Janson, S. (1988). Normal convergence by higher semiinvariants with applications 
to sums of dependent random variables and random graphs. Ann. Probab. 16 
305-312. MR0920273 

[9] Kantorovitz, M. R., Booth, H. S., Burden, C. J. and Wilson, S. R. (2007). 
Asymptotic behavior of fc-word matches between two random sequences. J. Appl. 
Probab. 44 788-805. 

[10] Lippert, R. a., Huang, H. and Waterman, M. S. (2002). Distributional regimes 
for the number of fc-word matches between two random sequences. Proc. Natl. 
Acad. Sci. USA 99 13980-13989. MR1944413 

[11] Miller, R. T., Christoffels, A. G., Gopalakrishnan, C, Burke, J., Ptitsyn, 
A. A., Broveak, T. R. and Hide, W. A. (1999). A comprehensive approach to 
clustering of expressed human gene sequence: The sequence tag alignment and 
consensus knowledge base. Genome Res. 9 1143-1155. 



APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 21 



[12] Melko, O. M. and Mushegian, A. R. (2004). Distribution of words witli a prede- 
fined range of mismatches to a DNA probe in bacterial genomes. Bioinformatics 
20 67-74. 

[13] Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular 
subsequences. J. Mol. Biol. 147 195-197. 

[14] Stein, C. (1972). A bound for the error in the normal approximation to the distribu- 
tion of a sum of dependent random variables. Proceedings of the Sixth Berkeley 
Symposium on Mathematical Statistics and Probability 2 583-602. Univ. Cali- 
fornia Press, Berkeley, CA. MR0402873 

[15] Stein, C. (1986). Approximate Computation of Expectations. IMS, Hayward, CA. 
MR0882007 

[16] ViNGA, S. and Almeida, J. S. (2003). Alignment-free sequence comparison — a re- 
view. Bioinformatics 19 513-523. 

[17] Waterman, M. S. (2000). Introduction to Computational Biology. Chapman and 
Hall/CRC, New York. 

[18] Zhang, Y. X., Perry, K., Vinci, V. A., Powell, K., Stemmer, W. P. and del 
Cardayre, S. B. (2002). Genome shuffling leads to rapid phenotypic improve- 
ment in bacteria. Nature 415 644-646. 



C. J. Burden 

.John Curtin School of Medical Research 

AND Mathematical Sciences Institute 
Australian National University 
Canberra, A.C.T. 0200 
Australia 

E-MAIL: conrad.burden@anu.cdu.au 



M. R. Kantorovitz 
Department of Mathematics 
University of Illinois 
Urbana, Illinois 61801 
USA 

E-MAIL; ruth@math.uiuc.edu 



S. R. Wilson 

Mathematical Sciences Institute 
Australian National University 
Canberra, A.C.T. 0200 
Australia 

E-MAIL: sue.wiIson@anu.edu.au